Housing Rental Analysis for San Francisco¶

Data visualisation skills (aggregation, interactive visualisations, and geospatial analysis) were used to find properties in the San Francisco market that are viable investment opportunities.

Instructions¶

The main tasks is to visualise and analyse the real-estate data in Jupyter notebook, include:

Calculate and Plot the Housing Units per Year¶

  • What’s the overall trend in housing units over the period that you’re analysing?

Calculate and Plot the Average Sale Prices per Square Foot¶

  • Did any year experience a drop in the average sale price per square foot compared to the previous year?

  • If so, did the gross rent increase or decrease during that year?

Compare the Average Sale Prices by Neighbourhood¶

  • For the Anza Vista neighbourhood, is the average sale price per square foot for 2016 more or less than the price that’s listed for 2012?

Build an Interactive Neighbourhood Map¶

  • Which neighbourhood has the highest gross rent, and which has the highest sale price per square foot?

Compose Data Story¶

  • How does the trend in rental income growth compare to the trend in sales prices? Does this same trend hold true for all the neighbourhoods across San Francisco?

  • What insights can you share with your company about the potential one-click, buy-and-rent strategy that they're pursuing? Do neighbourhoods exist that you would suggest for investment, and why?

In [1]:
# Import the required libraries and dependencies
import pandas as pd
import hvplot.pandas
from pathlib import Path

Import the data¶

In [62]:
# Using the read_csv function and Path module, create a DataFrame 
# by importing the sfo_neighbourhoods_census_data.csv file from the Resources folder
path = Path("./Resources/sfo_neighbourhoods_census_data.csv")
sfo_data_df = pd.read_csv(path).dropna()

# Review the first and last five rows of the DataFrame
sfo_data_df
Out[62]:
year neighbourhood sale_price_sqr_foot housing_units gross_rent
0 2010 Alamo Square 291.182945 372560 1239
1 2010 Anza Vista 267.932583 372560 1239
2 2010 Bayview 170.098665 372560 1239
3 2010 Buena Vista Park 347.394919 372560 1239
4 2010 Central Richmond 319.027623 372560 1239
... ... ... ... ... ...
392 2016 Telegraph Hill 903.049771 384242 4390
393 2016 Twin Peaks 970.085470 384242 4390
394 2016 Van Ness/ Civic Center 552.602567 384242 4390
395 2016 Visitacion Valley 328.319007 384242 4390
396 2016 Westwood Park 631.195426 384242 4390

392 rows × 5 columns


Calculate and Plot the Housing Units per Year¶

Step 1: Use the groupby function to group the data by year. Aggregate the results by the mean of the groups.¶

In [63]:
# Create a numerical aggregation that groups the data by the year and then averages the results.
df_by_year = sfo_data_df.groupby('year').mean()
df_by_year
Out[63]:
sale_price_sqr_foot housing_units gross_rent
year
2010 369.344353 372560.0 1239.0
2011 341.903429 374507.0 1530.0
2012 399.389968 376454.0 2324.0
2013 483.600304 378401.0 2971.0
2014 556.277273 380348.0 3528.0
2015 632.540352 382295.0 3739.0
2016 697.643709 384242.0 4390.0
In [64]:
housing_units_by_year = df_by_year['housing_units']

# Review the DataFrame
housing_units_by_year
Out[64]:
year
2010    372560.0
2011    374507.0
2012    376454.0
2013    378401.0
2014    380348.0
2015    382295.0
2016    384242.0
Name: housing_units, dtype: float64

Step 2: Use the hvplot function to plot the housing_units_by_year DataFrame as a bar chart. Make the x-axis represent the year and the y-axis represent the housing_units.¶

Step 3: Style and format the line plot to ensure a professionally styled visualisation.¶

In [65]:
# Create a visual aggregation explore the housing units by year
housing_units_by_year.hvplot.bar(x='year', y='housing_units', xlabel= 'Year', ylabel= 'Housing Units', ylim=(365000,385000)
                                ).opts(yformatter='%.0f', title = 'Housing units in San Francisco from 2010 to 2016')
Out[65]:

Step 5: Answer the following question:¶

Question: What is the overall trend in housing_units over the period being analysed?

Answer: From 2010 to 2016, the overall trend in housing units continued to increase


Calculate and Plot the Average Sale Prices per Square Foot¶

Step 1: Group the data by year, and then average the results.¶

In [66]:
# Create a numerical aggregation by grouping the data by year and averaging the results
prices_square_foot_by_year = df_by_year[['sale_price_sqr_foot','gross_rent']]

# Review the resulting DataFrame
prices_square_foot_by_year
Out[66]:
sale_price_sqr_foot gross_rent
year
2010 369.344353 1239.0
2011 341.903429 1530.0
2012 399.389968 2324.0
2013 483.600304 2971.0
2014 556.277273 3528.0
2015 632.540352 3739.0
2016 697.643709 4390.0
In [67]:
prices_square_foot_by_year.gross_rent.min()
Out[67]:
1239.0

Question: What is the lowest gross rent reported for the years included in the DataFrame?

Answer: 1239.0

Step 2: Use hvPlot to plot the prices_square_foot_by_year DataFrame as a line plot.¶

Step 3: Style and format the line plot to ensure a professionally styled visualisation.¶

In [68]:
# Plot prices_square_foot_by_year. 
# Inclued labels for the x- and y-axes, and a title.
prices_square_foot_by_year.hvplot.line(x='year', y= ['sale_price_sqr_foot','gross_rent'], xlabel= 'Year', ylabel = 'Gross Rent / Sale Price Per Square Foot'
                                ).opts(title = 'Sale Price Per Square Foot and Average Gross Rent - 2010-2016 - San Francisco')
Out[68]:

Step 6: Use both the prices_square_foot_by_year DataFrame and interactive plots to answer the following questions:¶

Question: Did any year experience a drop in the average sale price per square foot compared to the previous year?

Answer: In 2011, the average sale price per square foot decreased compare to the previous year.

Question: If so, did the gross rent increase or decrease during that year?

Answer: However, in this year (2011), the gross rent increased.


Compare the Average Sale Prices by Neighbourhood¶

Step 1: Create a new DataFrame that groups the original DataFrame by year and neighbourhood. Aggregate the results by the mean of the groups.¶

In [94]:
# Group by year and neighbourhood and then create a new dataframe of the mean values
prices_by_year_by_neighbourhood = sfo_data_df.groupby(['year','neighbourhood']).mean()

# Review the DataFrame
prices_by_year_by_neighbourhood.head()
Out[94]:
sale_price_sqr_foot housing_units gross_rent
year neighbourhood
2010 Alamo Square 291.182945 372560.0 1239.0
Anza Vista 267.932583 372560.0 1239.0
Bayview 170.098665 372560.0 1239.0
Buena Vista Park 347.394919 372560.0 1239.0
Central Richmond 319.027623 372560.0 1239.0
In [70]:
prices_by_year_by_neighbourhood.tail()
Out[70]:
sale_price_sqr_foot housing_units gross_rent
year neighbourhood
2016 Telegraph Hill 903.049771 384242.0 4390.0
Twin Peaks 970.085470 384242.0 4390.0
Van Ness/ Civic Center 552.602567 384242.0 4390.0
Visitacion Valley 328.319007 384242.0 4390.0
Westwood Park 631.195426 384242.0 4390.0

Step 2: Filter out the “housing_units” column to create a DataFrame that includes only the sale_price_sqr_foot and gross_rent averages per year.¶

In [95]:
# Filter out the housing_units
prices_by_year_by_neighbourhood = prices_by_year_by_neighbourhood.drop(['housing_units'], axis = 1)
In [96]:
# Review the first and last five rows of the DataFrame
prices_by_year_by_neighbourhood
Out[96]:
sale_price_sqr_foot gross_rent
year neighbourhood
2010 Alamo Square 291.182945 1239.0
Anza Vista 267.932583 1239.0
Bayview 170.098665 1239.0
Buena Vista Park 347.394919 1239.0
Central Richmond 319.027623 1239.0
... ... ... ...
2016 Telegraph Hill 903.049771 4390.0
Twin Peaks 970.085470 4390.0
Van Ness/ Civic Center 552.602567 4390.0
Visitacion Valley 328.319007 4390.0
Westwood Park 631.195426 4390.0

392 rows × 2 columns

Step 3: Create an interactive line plot with hvPlot that visualises both sale_price_sqr_foot and gross_rent. Set the x-axis parameter to the year (x="year"). Use the groupby parameter to create an interactive widget for neighbourhood.¶

Step 4: Style and format the line plot to ensure a professionally styled visualisation.¶

In [97]:
# Use hvplot to create an interactive line plot of the average price per square foot
# The plot should have a dropdown selector for the neighbourhood
prices_by_year_by_neighbourhood.hvplot.line(x='year', y= ['sale_price_sqr_foot','gross_rent'], groupby= 'neighbourhood',
                                            xlabel= 'Year', ylabel = 'Gross Rent / Sale Price Per Square Foot'
                                ).opts(title = 'Sale Price Per Square Foot and Average Gross Rent - 2010-2016 - By Neighborhood')
Out[97]:

Step 6: Use the interactive visualisation to answer the following question:¶

Question: For the Anza Vista neighbourhood, is the average sale price per square foot for 2016 more or less than the price that’s listed for 2012?

Answer: For the Anza Vista neighbourhood, the average sale price per square foot for 2016 is less than the price that’s listed for 2012.


Build an Interactive Neighbourhood Map¶

Step 1: Read the neighbourhoods_coordinates.csv file from the Resources folder into the notebook, and create a DataFrame named neighbourhood_locations_df. Be sure to set the index_col of the DataFrame as “Neighbourhood”.¶

In [73]:
# Load neighbourhoods coordinates data
neighbourhood_locations_df = pd.read_csv(Path('./Resources/neighbourhoods_coordinates.csv'), index_col = 'Neighbourhood')

# Review the DataFrame
neighbourhood_locations_df
Out[73]:
Lat Lon
Neighbourhood
Alamo Square 37.791012 -122.402100
Anza Vista 37.779598 -122.443451
Bayview 37.734670 -122.401060
Bayview Heights 37.728740 -122.410980
Bernal Heights 37.728630 -122.443050
... ... ...
West Portal 37.740260 -122.463880
Western Addition 37.792980 -122.435790
Westwood Highlands 37.734700 -122.456854
Westwood Park 37.734150 -122.457000
Yerba Buena 37.792980 -122.396360

73 rows × 2 columns

Step 2: Using the original sfo_data_df Dataframe, create a DataFrame named all_neighbourhood_info_df that groups the data by neighbourhood. Aggregate the results by the mean of the group.¶

In [75]:
# Calculate the mean values for each neighbourhood
all_neighbourhood_info_df = sfo_data_df.groupby('neighbourhood').mean()

# Review the resulting DataFrame
all_neighbourhood_info_df.drop('year', axis = 1, inplace = True)
In [76]:
all_neighbourhood_info_df
Out[76]:
sale_price_sqr_foot housing_units gross_rent
neighbourhood
Alamo Square 366.020712 378401.00 2817.285714
Anza Vista 373.382198 379050.00 3031.833333
Bayview 204.588623 376454.00 2318.400000
Bayview Heights 590.792839 382295.00 3739.000000
Bernal Heights 576.746488 379374.50 3080.333333
... ... ... ...
West Portal 498.488485 376940.75 2515.500000
Western Addition 307.562201 377427.50 2555.166667
Westwood Highlands 533.703935 376454.00 2250.500000
Westwood Park 687.087575 382295.00 3959.000000
Yerba Buena 576.709848 377427.50 2555.166667

73 rows × 3 columns

Step 3: Review the two code cells that concatenate the neighbourhood_locations_df DataFrame with the all_neighbourhood_info_df DataFrame.¶

Note that the first cell uses the Pandas concat function to create a DataFrame named all_neighbourhoods_df.

The second cell cleans the data and sets the “Neighbourhood” column.

Be sure to run these cells to create the all_neighbourhoods_df DataFrame, which you’ll need to create the geospatial visualisation.

In [77]:
# Using the Pandas `concat` function, join the 
# neighbourhood_locations_df and the all_neighbourhood_info_df DataFrame
# The axis of the concatenation is "columns".
# The concat function will automatially combine columns with
# identical information, while keeping the additional columns.
all_neighbourhoods_df = pd.concat(
    [neighbourhood_locations_df, all_neighbourhood_info_df], 
    axis="columns",
    sort=False
)

# Review the resulting DataFrame
display(all_neighbourhoods_df.head())
display(all_neighbourhoods_df.tail())
Lat Lon sale_price_sqr_foot housing_units gross_rent
Alamo Square 37.791012 -122.402100 366.020712 378401.0 2817.285714
Anza Vista 37.779598 -122.443451 373.382198 379050.0 3031.833333
Bayview 37.734670 -122.401060 204.588623 376454.0 2318.400000
Bayview Heights 37.728740 -122.410980 590.792839 382295.0 3739.000000
Bernal Heights 37.728630 -122.443050 NaN NaN NaN
Lat Lon sale_price_sqr_foot housing_units gross_rent
Yerba Buena 37.79298 -122.39636 576.709848 377427.5 2555.166667
Bernal Heights NaN NaN 576.746488 379374.5 3080.333333
Downtown NaN NaN 391.434378 378401.0 2817.285714
Ingleside NaN NaN 367.895144 377427.5 2509.000000
Outer Richmond NaN NaN 473.900773 378401.0 2817.285714
In [78]:
# Call the dropna function to remove any neighbourhoods that do not have data
all_neighbourhoods_df = all_neighbourhoods_df.reset_index().dropna()

# Rename the "index" column as "Neighbourhood" for use in the Visualisation
all_neighbourhoods_df = all_neighbourhoods_df.rename(columns={"index": "Neighbourhood"})

# Review the resulting DataFrame
display(all_neighbourhoods_df.head())
display(all_neighbourhoods_df.tail())
Neighbourhood Lat Lon sale_price_sqr_foot housing_units gross_rent
0 Alamo Square 37.791012 -122.402100 366.020712 378401.0 2817.285714
1 Anza Vista 37.779598 -122.443451 373.382198 379050.0 3031.833333
2 Bayview 37.734670 -122.401060 204.588623 376454.0 2318.400000
3 Bayview Heights 37.728740 -122.410980 590.792839 382295.0 3739.000000
5 Buena Vista Park 37.768160 -122.439330 452.680591 378076.5 2698.833333
Neighbourhood Lat Lon sale_price_sqr_foot housing_units gross_rent
68 West Portal 37.74026 -122.463880 498.488485 376940.75 2515.500000
69 Western Addition 37.79298 -122.435790 307.562201 377427.50 2555.166667
70 Westwood Highlands 37.73470 -122.456854 533.703935 376454.00 2250.500000
71 Westwood Park 37.73415 -122.457000 687.087575 382295.00 3959.000000
72 Yerba Buena 37.79298 -122.396360 576.709848 377427.50 2555.166667

Step 4: Using hvPlot with GeoViews enabled, create a points plot for the all_neighbourhoods_df DataFrame. Be sure to do the following:¶

  • Set the geo parameter to True.
  • Set the size parameter to “sale_price_sqr_foot”.
  • Set the color parameter to “gross_rent”.
  • Set the frame_width parameter to 700.
  • Set the frame_height parameter to 500.
  • Include a descriptive title.
In [79]:
# Create a plot to analyse neighbourhood info
all_neighbourhoods_df.hvplot.points(
    'Lon',
    'Lat',
    geo = True,
    size = 'sale_price_sqr_foot',
    color = 'gross_rent',
    frame_width = 700,
    frame_height = 500,
    tiles = True,
    #alpha = 0.3,
    hover_cols = ['Neighbourhood']
    
)
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:245: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  if len(multi_line_string) > 1:
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:297: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.
  for line in multi_line_string:
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:364: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  if len(p_mline) > 0:
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:245: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  if len(multi_line_string) > 1:
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:297: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.
  for line in multi_line_string:
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:364: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  if len(p_mline) > 0:
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:245: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  if len(multi_line_string) > 1:
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:297: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.
  for line in multi_line_string:
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:364: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  if len(p_mline) > 0:
Out[79]:

Step 5: Use the interactive map to answer the following question:¶

Question: Which neighbourhood has the highest gross rent, and which has the highest sale price per square foot?

Answer: The neighbourhood Westwood Park has the highest gross rent. The neighbourhood Union Square District has the highest sale price per square foot.

Compose Your Data Story¶

Based on the visualisations that you created, answer the following questions:

Question: How does the trend in rental income growth compare to the trend in sales prices? Does this same trend hold true for all the neighbourhoods across San Francisco?

Answer: Except for 2011, the sales prices showed a small increase from 2010 to 2016, while rental income growth has been showing strong growth. Although the growth rate of rental income growth slowed in 2011 and 2015, the rental income growth still increased from the previous year. The rental income growth maintained the same trend for all the neighbourhood across San Franciso. However, the sale price showed various state of volatility.

Question: What insights can you share with your company about the potential one-click, buy-and-rent strategy that they're pursuing? Do neighbourhoods exist that you would suggest for investment, and why?

Answer: With consistently rising rents and relatively stable housing prices, buying is a good strategy. However, since housing price fluctuations vary from region to region, choosing a quality area to buy becomes the key of the strategy.

Silver Terrace will be a good choice for buying, since it has relatively lower sale_price_sqr_foot value (much smaller point size in GeoViews) and higher gross_rent (much deeper point color in GeoViews). It is worth noting that among the dataset obtained, Silver Terrace only contains data for 2014, and more data needs to be collected to further verify the feasibility of the strategy.

In [80]:
sfo_data_df.head()
Out[80]:
year neighbourhood sale_price_sqr_foot housing_units gross_rent
0 2010 Alamo Square 291.182945 372560 1239
1 2010 Anza Vista 267.932583 372560 1239
2 2010 Bayview 170.098665 372560 1239
3 2010 Buena Vista Park 347.394919 372560 1239
4 2010 Central Richmond 319.027623 372560 1239
In [81]:
sfo_data_df[sfo_data_df.neighbourhood == 'Silver Terrace']
Out[81]:
year neighbourhood sale_price_sqr_foot housing_units gross_rent
280 2014 Silver Terrace 170.292549 380348 3528
In [ ]: